Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sanitization and validation for files #3

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

amirRamirfatahi
Copy link

No description provided.

Copy link
Contributor

@tipogi tipogi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about that comment? I cannot see anything related, is there any reason?

@tipogi tipogi linked an issue Dec 4, 2024 that may be closed by this pull request
@tipogi tipogi self-requested a review December 4, 2024 08:25
}

// Validate size
if self.size <= 0 || self.size > MAX_SIZE {
Copy link
Collaborator

@SHAcollision SHAcollision Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we taking the client declared size in /pubky.app/file/<object> at face value? Is there any way we can actually check the size of the blob? Maybe we should have a blob model with the validation that should be passed after we get the blob response in ingest() and before store_blob() in events/handlers/file.rs

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the user has any incentive to lie about the size of the file.
Having said that, it looks close to the content-type issue. Our best check would be to make sure it matches the content-length header returned from the homeserver.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something we can write into this specs crate? Should be as part of the /blob validation that probably has to run as a child validation of the /file one ?


Self {
name,
created_at: self.created_at,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we cannot use hash_id. We have a created_at timestamp anyway, therefore it is duplicated information. It's a straightforward method for the user to post 3 times the same .gif without creating 3 times the file on his homeserver or the indexer.

Alternatively, we have name field, that could as well be the file and blob name instead.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the emphasis on hash_id for? I understand its use case for something like tags, but we don't need that here.
The use case you're mentioning is not solved by having the same hash_id. With the current way, a sane user can create one blob, one file and then use that one file anywhere he wants. Should we stop the ability of someone to create multiple blobs and multiple files? I don't think so. I think it's important to remember we're a proxy here and people can do whatever they want with their homeservers, so while we enable, we shouldn't really look into ways of prohibiting some use cases, no matter how fringe they might sound to us, when there's no problem with having them.

Copy link
Collaborator

@SHAcollision SHAcollision Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, not necessarily saying hash_id should go on the /file object only, I am talking about the obvious big benefits of hash_id for the data blob. Are the /blobs and the way Nexus uses them covered by the specs?

There is no use for the timestamp ids.

Consider what happens when in pubky-app I am a shit poster and I reply to 10 users by dropping into the post editor modal the same animated gif to make fun of them.

people can do whatever they want with their homeservers

This is barely an argument as the specs are created with the explicit intention of restricting the way data is written into homeservers into a common set of rules that allow interoperability between social pubky-app clients and indexers. A user can write whatever data he wants in whatever spec breaking way he wants and simply share a URI.

when there's no problem with having them

I think free storage saving for anyone using pubky-app according to specs is really good for such a small change (just hash_id for blobs). Might seem a stupid optimization this early but changing IDs and schemas post-launch will be harder.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For blob ids it does make sense to have hash_ids.

// validate content type
match Mime::from_str(&self.content_type) {
Ok(mime) => {
if !VALID_MIME_TYPES.contains(&mime.essence_str()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Declared mime type could be pdf and blob be a exe , right? On click on the browser from pubky-app, the user will download a .exe ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. The user doesn't really have an incentive to lie about this.
Ultimately of course, we can have some security checker that runs the files through some malware detector, but mind you, something like this needs to be implemented on the homeserver first. The same goes for content type and sniffing the actual file type of the blob.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user doesn't really have an incentive to lie about this.

Can you elaborate?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm saying for normal users there's no need to lie about the content type of their files.
For malicious intent, you don't have to lie about the content type. IMO, If there are to be checks for this, it should be on the homeserver, not here.

src/file.rs Outdated
const MIN_NAME_LENGTH: usize = 1;
const MAX_NAME_LENGTH: usize = 255;
const MAX_SRC_LENGTH: usize = 1024;
const MAX_SIZE: i64 = 10_000_000; // 10 MB
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we using decimal MB?

To avoid confusion maybe best to do 2^20 B

const MAX_SIZE: i64 = 10 * (1 << 20);

or 10_485_760

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feat: files validation, sanitization, docs.
3 participants